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Abstract 

The  relationship  between  ‘learning’  in  adaptive  layered  networks  and  the  fitting  of 
data  with  high  dimensional  surfaces  is  discussed.  This  leads  naturally  to  a  picture 
of  ‘generalisation’  in  terms  of  interpolation  between  known  data  points  and  suggests 
a  rational  approach  to  the  theory  of  such  networks.  A  class  of  adaptive  networks  is 
identified  which  makes  the  interpolation  scheme  explicit.  This  class  has  the  property 
that  learning  is  equivalent  to  the  solution  of  a  set  of  linear  equations.  These  networks 
thus  represent  nonlinear  relationships  while  having  a  guaranteed  learning  rule. 
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1  Introduction 


The  strong  resurgence  of  interest  in  ‘self-learning  machines’,  whose  ancestors  include  the 
perceptrons  of  the  1960’s,  is  at  least  partly  driven  by  the  expectation  that  they  will  provide  a 
new  source  of  algorithms/architectures  for  the  processing  of  complex  data.  This  expectation 
is  based  on  an  analogy  with  connectionist  models  of  animal  brains.  The  brain  is  able  to  cope 
with  sophisticated  recognition  and  inductive  tasks  far  beyond  the  capabilities  of  systems 
based  on  present  computing  logic  and  architectures. 

An  example  of  an  everyday  task  for  the  human  brain  which  illustrates  this  point,  is 
the  recognition  of  human  speech.  For  automatic  speech  recognition  one  wishes  to  deduce 
properties  (the  implied  message)  from  the  statistics  of  a  finite  input  data  set  even  though 
the  speech  will  be  subject  to  nonlinear  distortions  due  to  noise,  sex,  age,  health  and  accent 
of  the  speaker.  This  has  proved  to  be  notoriously  difficult.  Over  the  past  decade,  one 
of  the  most  successful  techniques  developed  for  speech  recognition  has  been  based  around 
Hidden  Markov  Models  (see  for  instance  [1,2,3]).  In  this  scheme,  speech  is  modelled  as  a 
sequence  of  causal  stationary  stochastic  processes  determined  by  a  finite  number  of  allowable 
states,  state  transitions,  a  matrix  of  stationary  transition  probabilities  and  an  initial  state 
distribution.  There  is,  in  addition,  a  learning  algorithm  (Baum- Welch  iteration)  to  fine 
tune  the  parameters  of  the  model  which  increases  the  likelihood  that  each  model  produces 
it’s  associated  data.  However,  one  of  the  problems  with  this  approach  involves  the  a  priori 
assumptions  made  regarding  the  topology  of  the  Hidden  Markov  model  (number  of  states, 
allowable  transitions  etc).  As  long  as  the  total  possible  input  data  is  consistent  with  the 
assumptions  made  in  the  original  model,  then  one  can  expect  a  faithful  representation 
of  the  data.  This  unfortunately  presupposes  that  we  already  know  how  to  model  speech 
adequately. 

In  an  attempt  to  circumvent  this  problem,  self-learning  machines  have  been  employed. 
The  virtue  of  these  is  that  no  explicit  model  of  speech  is  required.  For  instance,  the  multi¬ 
layer  perceptron  (which  is  a  layered  nonlinear  network)  embodies  a  set  of  nonlinear,  ‘hidden’ 
units  whose  task  is  to  encode  the  higher  order  constraints  of  the  input  data  [4].  This  is 
achieved  by  varying  weights  governing  the  strengths  of  connections  between  the  units  in 
order  to  minimise  the  error  in  relating  known  input-output  pairs  (the  ‘training’  set).  This 
process  has  become  known  as  “learning” .  The  ability  of  the  network  to  give  subsequently 
reasonable  (in  some  sense)  outputs  for  inputs  not  contained  in  the  training  set  is  termed 
“generalisation”.  In  efTect  a  multi-layer  perceptron  of  given  geometry  with  given  nonlinear 
responses  of  the  units  constitutes  an  M-parameter  family  of  models  (where  M  is  the  total 
number  of  weights  which  are  varied).  It  is  currently  an  act  of  faith,  based  on  encouraging 
practical  results,  that  such  families  are  broad  enough  to  include  models  of  speech  which  are 
adequate  for  the  purposes  of  classification.  This  means,  however,  that  the  design  of  a  multi¬ 
layer  perceptron  for  a  specific  task  remains  an  empirical  art.  In  particular,  how  many  hidden 
units  should  b?  employed  and  in  what  configuration,  how  much  training  data  is  needed, 
and  what  initial  inter-connection  strengths  ba^o  to  be  assigned7  Some  experimental  work 
has  been  performed  which  addresses  these  problems  (for  instance  [5,6])  although  it  is  fair  to 
comment  that  an  understanding  of  these  issues  is  still  lacking.  The  source  of  the  difficulty 
is  the  implicit  relationship  between  these  externally  controllable  factors,  and  the  model 
ultimately  represented  by  the  network. 

The  present  paper  investigates  the  implicit  assumptions  made  when  employing  a  feed¬ 
forward  layered  network  model  to  analyse  complex  data.  The  approach  will  be  to  view  such 
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a  network  as  representing  a  map  from  an  n-dimensional  input  space  to  an  n' -dimensional 
output  space,  say  a  :  Rn  — *  Rn’ .  This  map  will  be  thought  of  as  a  graph  T  C  Rn  x  Rn’  (in  the 
same  way  that  a  :  R  — *  R,  where  a(x)  =  x3  may  be  thought  of  as  a  parabola  drawn  in  RJ). 
From  this  point  of  view,  “error  free”  training  data  presented  to  the  network  in  the  form  of 
input-output  pairs  represent  points  on  the  graph,  T  and  the  learning  phase  of  an  adaptive 
network  constitutes  the  optimisation  for  a  fitting  procedure  for  T  based  on  the  known 
data  points.  Generalisation  is  therefore  synonymous  with  interpolation  between  the  data 
points  with  the  interpolation  being  along  the  constrained  surface  generated  by  the  fitting 
procedure  as  the  optimum  approximation  to  T.  In  this  picture  the  implicit  assumptions 
made  when  using  a  multi-layer  perceptron  concern  the  nature  of  the  fitting  procedure,  and 
clearly  relate  directly  to  the  way  in  which  the  network  generalises.  Thus,  we  are  led  to  the 
theory  of  multi-variable  interpolation  in  high  dimensional  spaces.  In  subsequent  sections, 
we  shall  exploit  some  of  the  mathematics  of  this  expanding  field  of  research  to  develop  a 
new  type  of  layered  network  model  in  which  the  nature  of  the  fitting  procedure  is  explicit. 
This  class  of  layered  network  model  will  be  shown  to  be  of  considerable  interest  in  itself.  In 
addition  however,  it  is  hoped  that  the  explicit  nature  of  the  fitting  procedure  will  allow  us 
to  develop  a  better  understanding  of  the  general  properties  of  layered  nonlinear  networks 
which  perform  an  equivalent  function. 


2  Multi-variable  functional  interpolation 
using  radial  basis  functions 


This  section  introduces  briefly  the  method  of  Radial  Boats  Functions ,  a  technique  for  in¬ 
terpolating  in  a  high  dimensional  space  which  has  recently  seen  important  developments. 
Further  details  may  be  obtained  from  the  review  article  of  Powell  |7]  and  the  important 
contribution  of  Micchelli  [8], 

In  the  cited  references  the  radial  basis  function  approach  is  applied  to  the  strict  inter¬ 
polation  problem  which  may  be  summarised  as  follows:- 

Problem:  Given  a  set  of  m  distinct  vectors  (data  points),  {x,;  i  =  1,2, . .  .  ,m}  in  Rn  and 
m  real  numbers  { /, ; *  =  1,2, choose  a  function  a  :  R"  — >  R  which  satisfies  the 
interpolation  conditions 

*(*.)  =  /.  *  =  1, 2, ....  m  (1) 

Note  that  the  function,  s,  is  constrained  to  go  through  the  known  data  points. 

There  are  clearly  many  criteria  one  could  impose  which  would  restrict  the  possible 
functional  form  of  s(x)  (see  for  instance  (9j).  The  Radial  Basis  Function  approach  constructs 
a  linear  space  which  depends  on  the  position  relative  to  the  known  data  points  according 
to  an  arbitrary  distance  measure.  Thus,  a  set  of  m  arbitrary  (generally  nonlinear)  ‘basis' 
functions  4>{\\x  —  jJJ)  is  introduced,  where  x  €  Rn  and  || . . .  ||  denotes  a  norm  imposed  on 
R"  which  is  usually  taken  to  be  Euclidean.  The  vectors  yt  €  Rn  ,«  =  l,2,...,m  are  the 
centres  of  the  basis  functions  and  taken  to  be  sample  data  points.  In  terms  of  these  basis 
functions,  we  consider  interpolating  functions  of  the  form:- 

m 

«U)  =  52  Mdls  -  ]Lj I!)  £  e  Rn  (2) 
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Inserting  the  interpolation  conditions,  equation  (1)  into  equation  (2),  gives  the  following 
set  of  linear  equations  for  the  coefficients  {Ay}, 


where 

Aij  =  -gyll)  *,J  =  (4) 

Given  the  existence  of  the  inverse  to  matrix  A  with  elements  A ,y,  equation  (3)  uniquely 
defines  the  coefficients  Ay  through  A  =  A-1/- 

In  general,  one  might  expect  that  for  an  arbitrary  choice  of  <f>  the  matrix  A  could  be 
singular.  However,  the  results  of  Micchelli  prove  that  for  all  positive  integers  m,n  and  for 
a  large  class  of  functions  <j>,  the  matrix  A  is  non-singular  if  the  data  points  are  all  distinct  . 

This  discussion  readily  generalises  to  maps  a  :  Rn  — *  R"\  In  this  case  the  m  distinct 
data  points  in  R"  are  associated  with  m  vectors  e  Rn  .  The  interpolation  condition  of 
equation  (1)  thus  generalises  to 

sk(x*)  =  f>k  i  =  1,2,  ...,m  k=  1,2, (5) 

which  leads  to  interpolating  functions  of  the  form 

s*(x)  =  f;Ay^(||x-1,.||)  I6R"  k  =  1, 2, ....  n’  (6) 

j  =  i 

The  expansion  coefficients  Ay*  are  obtained  using  the  inverse  of  the  same  matrix  A  defined 
in  equation  (4). 

Once  a  suitable  choice  of  the  function  4>  is  made,  and  a  convenient  distance  measure  im¬ 
posed,  the  above  relations  exactly  specify  the  interpolation  problem  which  has  a  guaranteed 
solution. 

However,  for  certain  classes  of  problem,  the  above  analysis  may  not  be  a  good  strategy 
for  the  following  reason.  A  basic  consideration  when  fitting  data  is  the  number  of  degrees  of 
freedom  required.  That  is,  the  minimum  number  of  basis  functions  needed  to  generate  an 
acceptable  fit  to  the  data.  In  the  situation  where  the  number  of  data  points  far  exceeds  the 
number  of  degrees  of  freedom  there  will  be  redundancy  since  we  are  constrained  to  use  as 
many  radial  basis  functions  as  data  points.  In  this  case  the  strict  interpolation  conditions 
generally  result  in  this  redundancy  being  used  to  fit  misleading  variations  due  to  imprecise, 
or  noisy  data. 

It  is  possible  to  avoid  this  difficulty  by  weakening  the  interpolation  conditions.  We 
suggest  the  following  generalisations  to  the  conventional  radial  basis  function  approach. 
First,  it  may  be  necessary  to  distinguish  between  the  data  points,  i  =  1,2, .  ..,m)  and 
the  radial  basis  function  centres,  (j/l ,  j  =  1,2,  ...,n0  no  <  m)  *.  The  problem  thus 
becomes  overspecified,  the  matrix  A  is  not  square,  a  unique  inverse  no  longer  exists,  and  the 
previous  exact  problem  becomes  one  of  linear  optimisation.  In  the  following,  we  shall  adopt 

'In  particular,  we  do  not  necesiarily  require  that  the  radial  basis  function  centres  correspond  to  any  of  the 
data  points. 
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a  minimum  norm  least  squares  method  by  introducing  the  Moore-Penrose  pseudo-inverse, 
A+  of  matrix  A.  For  the  case  where  rankA  =  no,  this  has  the  property  that  A+A=  1 
where  1  denotes  the  no  x  no  identity  matrix.  The  pseudo-inverse  provides  a  unique  solution 
to  the  linear  least  squares  problem  (10]  in  the  following  sense, 

Of  all  the  vectors  X  which  minimise  the  sum  of  squares  ||AA  -  /||2,  the  one  which  has 
the  smallest  norm  (and  hence  minimises  ||A||2y  *’«  given  by  X  =  A+ f. 

For  this  particular  solution  set,  an  expression  may  be  derived  for  the  normalised  error 
£ ,  specifically, 


where  (/)  is  the  mean  value  of  the  response  vector  over  all  the  training  data  points.  Note 
that  if  the  pseudo-inverse  equals  the  exact  inverse,  then  the  left  and  right  inverses  are  the 
same  and  hence  the  matrix  product  AA+  is  the  m-dimensional  identity  matrix,  and  this 
error  is  zero. 

An  additional  modification,  which  is  useful  particularly  when  /(i)  has  a  large  x-independent 
component,  is  to  incorporate  constant  offsets  {A0*}  into  the  form  of  the  interpolating  func¬ 
tions 

m 

«*U)  =  A0*+  A,t^(||£- j/,11)  i€Rn  k  =  1, 2, . . . ,  n'  (8) 

>=l 

These  coefficients  enter  the  least  squares  formalism  through  an  additional  column  in  A 

A,0  =  1  «  =  1,2,  .,m  (9) 

In  this  form,  the  radial  basis  function  approach  to  multi-variable  functional  interpolation 
has  a  close  resemblance  to  adaptive  network  theory.  This  will  be  discussed  in  the  following 
sections.  An  important  consequence  of  this  approach  which  should  be  emphasised  is  that 
the  determination  of  the  nonlinear  map  s(x)  has  been  reduced  to  a  problem  in  linear  algebra. 
The  coefficients  Ay*  appear  linearly  in  the  functional  form  of  the  mapping,  therefore  the 
problem  of  determining  the  precise  values,  even  in  an  overdetermined  situation,  has  been 
reduced  to  one  of  a  linear  least  squares  optimisation  which  has  a  ‘guaranteed  learning 
algorithm’  through  the  pseudo-inverse  technique  2.  Of  course,  this  .-eduction  has  assumed 
suitable  choices  for  the  centres,  {^}  and  the  function  d>.  It  may  be  argued,  therefore,  that 
the  restriction  of  the  optimisation  by  fixing  these  quantities  is  excessive  and  must  limit  the 
range  of  applicability  of  the  approach  In  the  case  of  the  strict  interpolation  this  does  not 
seem  to  be  the  case,  at  least  as  far  as  the  choice  of  <f>  is  concerned  1 1 5] .  There  is  evidence  to 
show  (17],  again  for  strict  interpolation,  that  the  efTect  of  a  suboptimal  choice  of  the  {^} 
is  to  reduce  the  rate  of  convergence  of  the  expansion  given  in  equation  (6).  For  the  least 
squares  extensions  described  here,  much  less  is  known. 

3  As  a  technical  numerical  point,  the  solution  will  not  generally  be  obtained  from  the  normal  equations 
(which  may  be  ill-conditioned),  but  would  be  obtained  via  the  efficient  procedure  of  singular  valued 
decomposition. 
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3  The  radial  basis  function  method 
viewed  as  a  layered  network 


Much  of  the  power  and  attraction  of  current  adaptive  network  models  is  contained  in  the 
nonlinear  aspects  of  the  hidden  units  so  that  nonlinear  input-output  relationships  may  be 
modelled.  Unfortunately,  since  the  error  criterion,  or  cost  function,  depends  on  the  response 
of  the  nonlinear  elements,  the  problem  of  finding  a  globally  optimum  solution  becomes  one 
of  unconstrained  nonlinear  least  squares  minimisation.  Such  problems  can  be  solved  usually 
only  by  iterative  techniques.  For  instance,  ‘error  back-propagation’  is  a  first  order  (only 
depends  on  the  gradient  of  the  function  of  interest)  steepest  descent  technique  which  suffers 
from  slow  convergence  properties  due  to  it’s  tendency  to  zig-zag  about  the  true  direction  to 
a  minimum.  More  sophisticated  iterative  schemes  derived  from  second  order  approximations 
have  been  proposed  which  improve  convergence  properties  (the  quasi-Newton,  or  variable 
metric  algorithms — see  for  instance  [11],  Chapter  15  and  [12]).  Recent  works  of  note  which 
have  considered  more  general  iterative  strategies  and  found  them  to  be  orders  of  magnitude 
superior  to  back-propagation  when  applied  to  specific  layered  network  problems  are  those  of 
Lapedes  and  Farber  [13]  (conjugate  gradients)  and  Watrous  [14]  ( Davidon-Fletcher-Powell 
and  Broyden-Fletc^er-Goldfarb-Shanno).  In  spite  of  this  effort,  the  difficulty  remains  that 
the  solution  obtained  by  such  methods  is  not  guaranteed  to  be  the  global  optimum  since 
local  minima  may  be  found.  There  is  no  reason  in  the  current  iterative  schemes  why  a 
least  squares  minimisation  solution  obtained,  will  necessarily  have  the  required  form  Even 
choosing  a  good  initial  starting  point  for  the  iteration  schemes  will  not  necessarily  imply 
that  a  good  approximation  to  the  global  minimum  will  be  obtained.  It  is  important  to 
know  how  ‘good’  a  solution  is  obtained  by  settling  for  a  local  minimum,  and  under  what 
conditions  the  solution  at  such  a  minimum  has  to  be  deemed  unsatisfactory. 

In  the  previous  section  it  was  shown  that  because  of  the  linear  dependence  on  the  weights 
in  the  radial  basis  function  expansion,  a  globally  optimum  least  squares  interpolation  of 
nonlinear  maps  can  be  achieved.  The  relevance  of  this  to  layered  network  models  is  that 
the  mapping  produced  by  the  radial  basis  function  expression  eqn.  (6),  has  the  form  of  a 
weighted  sum  over  nonlinear  functions.  There  is  thus  a  natural  correspondence  with  the 
following  general  3-layer  network  system,  in  which  the  layers  are  fully  interconnected  with 
adjacent  layers,  but  there  are  no  interconnections  within  the  layers  (see  figure  1). 

The  input  layer  of  this  network  model  is  a  set  of  r»-nodes  waiting  to  accept  the  com¬ 
ponents  of  the  n-dimensional  vector  z.  These  input  nodes  are  directly  connected  to  all  of 
the  hidden  nodes  Associated  with  each  connection  is  a  scalar  (y,;  for  the  link  between 
the  i-th  input  node  and  the  j-th  hidden  node)  such  that  the  fan-in  to  a  given  node  has 
the  form  of  a  hypersphere,  i.e.  in  the  case  of  a  Euclidean  norm,  the  input  to  the  node  is 
9j  —  £"_i(ii  -  y,j)J  where  the  x,  are  components  of  x.  The  ‘hidden  layer’  consists  of  a  set 
of  n o  nodes,  one  for  each  radial  basis  function  centre.  The  output  of  each  of  these  is  a  scalar, 
generally  nonlinear  function  of  0y  The  hidden  layer  is  fully  connected  to  an  output  layer 
corresponding  to  the  n'-components  of  the  n'-dimensional  response  vector  s(x)  of  the  net¬ 
work.  The  input  value  received  by  each  output  unit  is  a  weighted  sum  of  all  the  outputs  of 
the  hidden  units,  where  the  strengths  of  connections  from  the  j-th  hidden  unit  to  the  fc-th 
output  unit  are  denoted  by  A;*.  The  response  of  each  output  unit  is  a  linear  function  of  its 
net  input  which  may  include  the  bias  A o*.  A  natural  extension  is  to  allow  for  nonlinearity  in 
the  response  of  the  output  units.  Clearly,  if  the  transfer  function  of  these  units  is  invertible, 
then  it  can  be  accounted  for  by  a  suitable  modification  of  the  interpolation  conditions  used 
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Input  layer 


Hidden  layer 


Output  layer 


Figure  1:  A  schematic  diagram  of  the  feed-forward  layered  network  model  represented  by 
the  radial  basis  function  expansion. 
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to  derive  the  weights  {A;fc}.  This  specifies,  as  a  layered  network  model,  the  radial  basis 
function  expansion,  equation  (8),  which  produces  a  mapping  from  an  n-dimensional  input 
space  to  an  n'-dimensional  target  space. 


This  type  of  network  falls  within  the  general  class  of  nonlinear  layered  feedforward 
networks.  In  particular  we  have  specialised  the  geometry  to  a  single  hidden  layer  and  fixed 
the  fan-in  and  fan-out  to  the  hidden  units  as  indicated.  The  choice  of  hyperspherical  fan- 
in  to  the  hidden  units  has  the  effect  of  sectioning  the  decision  space  into  hyperspherical 
regions  rather  than  hyperplanes  which  result  from  the  more  usual  choice  of  a  scalar  product 
type  of  fan-in.  This  has  the  advantage  of  allowing  disjoint  regions  in  the  decision  space, 
but  which  pertain  to  the  same  classification,  to  be  satisfactorily  resolved  by  the  single 
‘hidden’  adaptive  layer.  Problems  without  simple  connectivity  would  traditionally  require 
two  hidden  adaptive  layers,  as  discussed  in  [  18] ,  whereas  the  approach  described  in  this 
paper  can  deal  with  such  probU  ns  by  employing  a  single  hidden  layer. 


In  principle,  the  total  set  of  adjustable  parameters  include  the  set  of  no-radial  basis 
function  centres,  y^.  as  well  as  the  set  of  (no  +  l)xn'  weights,  Ay*.  However,  only  the  latter 
are  included  in  the  least  squares  analysis  in  this  paper  in  order  to  preserve  the  linearity  of 
the  learning  procedure. 


The  radial  basis  function  strategy  may  be  applied  to  the  general  multi-layer  perceptron 
for  which  the  output  units  have  an  invertible  nonlinearity.  Moreover,  when  extended  to 
allow  for  variation  in  the  input-hidden  weights,  this  method  provides  an  interesting  pic¬ 
ture  of  learning.  In  particular,  the  hidden-output  weights  may  be  visualised  as  evolving 
on  a  different  ‘time  scale’  to  the  input-hidden  weights.  Thus,  as  the  input-hidden  weights 
evolve  slowly  by  some  nonlinear  optimisation  strategy,  the  hidden-output  weights  adjust 
themselves  rapidly  through  linear  optimisation  so  as  to  always  remain  in  the  global  mini¬ 
mum  of  an  evolving  error  surface  over  the  hidden-output  weights  which  is  parametrically 
controlled  by  the  input-hidden  weights. 


The  rest  of  this  paper  is  concerned  with  various  simple  applications  of  the  radial  basis 
function  network  assuming  a  fixed  set  of  centres.  In  the  absence  of  any  a  priori  knowledge, 
the  centres,  { y  )  are  either  distributed  uniformly  within  the  region  of  Rn  for  which  there 
is  data,  or  they  are  chosen  to  be  a  subset  of  the  training  points  by  analogy  with  strict 
interpolation.  We  expect  that  with  additional  knowledge  of  the  surface  to  be  fitted,  the 
freedom  to  position  the  centres  may  be  used  to  advantage  to  improve  the  convergence  of 
the  expansion  (although  not  necessarily  to  improve  the  ‘fit’  to  the  unseen  data).  Evidence 
for  this  as  well  as  insight  into  the  significance  of  the  centres  follows  from  the  work  of 
Powell  [15]  who  showed  that  for  strict  interpolation  when  n  =  n'  =  1  and  when  <f>[r)  = 
r2k+l,  (k  =  0,  1,. .  .),  the  radial  basis  function  method  is  equivalent  to  interpolation  with 
natural  splines.  In  this  case  the  {y^}  are  the  knots  of  the  spline  fit.  Naturally,  when  the 
strict  interpolation  is  weakened  to  give  a  least  squares  interpolation,  the  significance  of  the 
‘knots’  in  constraining  the  surface  is  also  weakened.  In  what  follows,  we  shall  attempt  to 
be  as  general  as  possible  in  the  analytic  work  by  assuming  an  arbitrary  form  of  4>.  Where 
numerical  work  necessitates  a  specific  choice,  we  have  chosen  to  employ  either  a  Gaussian 
form  ( 4>{r )  ss  expf-r2])  or  a  multiquadric  (^(r)  es  \/c2  +  r2). 
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4  Specific  example  (i): 

the  exclusive-OR  Problem  and  an  exact  solution. 


Input 

Pattern 

Number  of  ‘ON’ 
bits  in  input 

Output 

Pattern 

00  -* 

0  — ■ 

0 

01  — 

1  — 

1 

10  — 

1  — 

1 

11  — 

2  — ■ 

0 

Table  1:  Symbolic  mapping  of  the  exclusive-OR  problem. 


The  exclusive-OR  problem  defined  by  the  symbolic  mapping  in  Table  1  has  been  considered 
interesting  because  points  which  are  closest  (in  terms  of  the  Hamming  distance)  in  the  input 
space,  map  to  regions  which  are  maximally  apart  in  the  output  space.  This  is  a  classic 
example  of  a  logic  function  that  cannot  be  represented  by  a  linear  perceptron.  Clearly,  a 
function  which  interpolates  between  the  points  given  in  Table  1  must  oscillate,  and  may  be 
hard  to  represent  using  a  subset  of  data  points  unless  there  is  some  built  in  symmetry  in 
the  radial  basis  functions. 

In  what  follows,  we  initially  take  one  radial  basis  function  centre  determined  by  each 
piece  of  input  data,  so  that  both  j/  &  are  selections  of  the  four  ordered  input  patterns 
of  Table  1.  We  choose  to  number  the  four  possible  input  patterns  as  (0,0)  — ♦  1,  (0,  l)  — * 
2,  (1, 1)  —  3,  (1,0)  — »  4  which  we  visualise  to  be  the  cyclically  ordered  corners  of  a  square. 
The  action  of  the  network  may  be  represented  by 

*U)  = 

)=i 

so  that  the  set  {Ay}  may  be  found  from 

ft  =  ]£  MdlS*  -  Ifyll) 

>=» 

it. 

A  =  A"1/ 


For  the  ordering  we  have  chosen,  the  vector  £  and  the  matrix  A  take  the  specific  forms 


and 


V  1  / 


(  4>o 

<h  N 

4>  i 

4>  o 

4>i 

<Av/i 

^0 

4>i 

V  4>\ 

4>i 

4>o  J 
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where  a  labelling  notation  has  been  introduced  so  that  0 o  denotes  the  value  of  <p(||x,  —  x,|j), 
0 i  denotes  the  value  0(j|x,  -  x,±1||)  and  0^  represents  0(||x,  -  H),  all  counting  being 

performed  cyclically  around  the  square.  Note  that  this  is  a  labelling  device  and  we  will 
not  exploit  the  properties  of  any  particular  distance  function  at  this  stage  (although  the 
notation  0^/j  indicates  that  we  have  in  mind  a  Euclidean  metric  as  an  example). 

It  will  be  convenient  to  construct  A-1  from  the  eigenvalue  decomposition  (see  Ap¬ 
pendix  A  for  further  details) 


A  ee  V^Vr 

Since  A  is  real  symmetric,  V  is  an  orthogonal  matrix  with  columns  composed  out  of  the 
orthogonal  eigenvectors;  in  this  case, 


2 

and  p  is  the  real,  diagonal  matrix 

P  = 


/I 

1 

V2 

0\ 

1  -1 

0 

-y/2 

1 

1 

-v/2 

0 

Vi  -1 

0 

-v/2  / 

(PA 

0 

0 

0  \ 

0 

PB 

0 

0 

0 

0 

PE 

0 

\  0 

0 

0 

PE  ' 

(12) 


where 


Pa  =  (0  0  +  20i  +  0,/j) 

PB  =  (00  -  201  +  0yj)  (13) 

PE  =  (00  ~  0v/j) 


Note  that  at  this  stage  it  is  possible  to  decide  how  well  posed  the  original  problem  was, 
by  seeing  under  what  conditions  an  inverse  matrix  A-1  exists.  It  is  clear  from  the  form  of 
the  derived  eigenvalues  of  the  problem,  that  an  inverse  exists  as  long  as 

00^0^2  (14) 


or 


01  /  ± 


^  00  +  0y^  j 


(15) 


are  satisfied.  Thus  fai  the  analysis  has  been  carried  out  for  an  arbitrary  choice  of  non¬ 
linear  radial  basis  function  and  metric,  therefore  the  above  conditions  can  be  taken  to  be 
restrictions  on  the  various  combinations  of  radial  basis  function  and  metric  that  may  be 
reasonably  employed  for  this  problem.  It  is  interesting  to  point  out  two  situations  where 
an  inverse  does  not  exist. 


•  0(x)  =  mx  +  c  combined  with  a  city-block  metric 
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and  4>(x)  =  ax 2  +  c  combined  with  a  Euclidean  distance  function,  (||x||  =  xf) 


as  may  be  easily  verified  by  direct  substitution  into  equation  (15).  These  instances  corre¬ 
spond  to  the  network  structure  of  a  linear  perceptron  and  are  thus  unable  to  represent  the 
exclusive-OR  function. 

Given  the  inverse  of  A,  we  may  proceed  to  obtain  the  ‘weights’  of  the  network  as 

A  =  A  "V 


=  V^\Tl 


(16) 


=  V^\T  (vA  -  vB) 


where  we  have  exploited  the  fact  that  the  response  vector  eqn.  (10),  may  be  decomposed 
into  the  difference  of  the  basis  vectors  vA  and  ub  derived  in  the  appendix  (Appendix  A). 
Performing  the  above  operations  (which  is  simplified  since  vectors  orthogonal  to  vA,Vg  do 
not  contribute)  gives  the  final  result  that 


(17) 


(18) 


Equation  (17)  specifies  the  choice  of  network  weights  which  exactly  solves  the  exclusive- 
OR  problem.  The  weights  are  still  dependent  on  the  precise  choice  of  radial  basis  function 
and  distance  measure.  Clearly  we  are  free  to  choose  these  subject  to  condition  (15)  without 
affecting  the  solution  of  exclusive-OR.  This  choice  does  however  influence  the  output  of  the 
network  for  arbitrary ,  real-valued  inputs 

Figure  2  illustrates  the  solution  for  the  specific  choice  of  a  Euclidean  distance  function 
and  Gaussian  radial  basis  functions  (<f>{x)  =  exp[-x2/a]).  Similarly,  Figure  3  shows  the 
output  using  the  same  distance  function,  but  employing  multiquadric  radial  basis  functions 
(^(r)  =  \/\  +  x^/a).  In  both  instances,  the  mapping  surfaces  have  two  planes  of  reflection 
symmetry  through  the  diagonals  of  the  unit  square.  The  difference  is  that  the  Gaussian 
choice  produces  two  maxima  near  to  the  odd  parity  inputs  and  two  shallow  minima  close 
to  the  even  parity  inputs.  The  multiquadric  does  not  have  these  maxima  and  moreover 
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1.0000 


Figure  2:  The  exclusive-OR  solution  (i).  Obtained  using  a  Euclidean  distance  function  and 
Gaussian  radial  basis  functions  centred  at  the  corners  of  the  unit  square. 


Figure  3.  The  exclusive  OR  solution  (ii).  Obtained  using  a  Euclidean  distance  function 
and  multiquadric  radial  basis  functions  centred  at  the  corners  of  the  unit  square. 
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diverges  rapidly  outside  the  unit  square.  This  distinction  is  of  no  relevance  to  the  exclusive- 
OR  function  itself.  It  would  however,  be  significant  were  it  attempted  to  give  meaning 
to  input  and  output  values  other  than  those  represented  by  zero  and  unity.  Clearly,  the 
details  of  the  generalisation  would  then  be  dependent  on  the  specific  interpolation  scheme 
represented  by  the  network. 

We  conclude  this  section  with  a  discussion  on  the  number  of  ‘hidden  units’  employed 
to  solve  the  problem.  Note  that  the  problem  has  been  solved  exactly;  given  the  weights 
as  determined  by  eqn.  (18)  and  a  specific  choice  of  a  radial  basis  function,  applying  any  of 
the  input  pattern  pairs  will  guarantee  to  get  the  correct  output  answer.  On  preliminary 
inspection  this  may  not  seem  so  surprising  since  each  possible  input  data  point  was  used  as 
a  centre  for  a  radial  basis  function  and  so  a  ‘dictionary’  of  possibilities  could  be  encoded.3 

One  can  exploit  the  symmetry  of  the  solution  however,  to  show  how  it  is  still  possible 
to  solve  the  exclusive-OR  problem  exactly  without  explicitly  specifying  the  response  of 
the  whole  set  of  input  states.  Specifically,  from  eqn.  (18)  and  by  a  judicious  or  ‘fortuitous’ 
choice  of  nonlinear  function  <j>  (for  instance  if  <f>\  =  0  or  4>o  —  —  then  two  of  the  four 
possible  weights  would  be  zero.  This  uncouples  the  corresponding  pair  of  hidden  units  from 
the  system,  with  the  result  that  the  remaining  network  satisfies  the  exclusive-OR  function 
without  being  explicitly  ‘trained’  on  the  entire  possible  set  of  input/output  pairs. 


1 


Figure  4:  Equivalent  network  for  ex-  Figure  5:  Equivalent  network  for  ex¬ 
clusive-OR  with  A2  set  to  zero.  clusive-OR  with  Aj  set  to  zero. 

For  the  case  that  <f>\  =  0  (Figure  5)  the  two  identical  weights  connecting  the  two  hidden 
units  to  the  output  unit  have  a  value  of  l/[0o  +  ^^1-  this  case>  the  hidden  units  centred 
on  the  patterns  (0, 0),  (1, 1)  have  no  connections  to  the  output  and  hence  cannot  contribute 
Thus,  when  these  patterns  are  presented  to  the  network,  the  two  units  which  would  react 
most  strongly  to  their  presence  have  been  disconnected  from  the  output  unit  while  the 
remaining  two  respond  with  <f>\  —  0  as  expected  Alternatively,  if  the  patterns  (0, 1),  (1,0) 
are  presented,  one  hidden  unit  contributes  a  value  of  ^0  and  the  other  a  value  of  <t> v/j.  Since 
their  sum  is  just  l/Aj  the  result  of  the  network  is  to  give  the  answer  1  as  it  should.  A 
similar  argument  may  be  presented  for  the  case  when  ^0  =  -^v/j. 

In  either  case,  the  surfaces  shown  in  Figure  2  and  Figure  3  are  constrained  sufficiently 
for  the  specification  of  just  two  points  to  fix  the  remaining  pair  of  output  values.  Here 

’However,  note  that  this  scheme  has  achieved  a  fit  with  four  adjustable  parameter*,  the  weight*  X,,  whereas 
the  standard  2-2-1  multi-layer  perceptron  would  employ  nine  adjustable  parameters. 
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we  have,  by  a  suitable  choice  of  <fr,  adjusted  the  form  of  the  fitting  algorithm  to  admit  a 
network  which  ‘generalises’  the  incomplete  training  data  to  give  the  entire  exclusive-OR 
function.  This  sort  of  procedure  can  of  course  be  employed  by  a  multi-layer  perceptron 
However,  it  should  be  clear  that  for  a  strongly  folded  surface  such  as  represents  the 
exclusive-OR  function  (or  the  more  general  n-bit  parity  function  shown  in  Appendix  B)  the 
presence  or  absence  of  redundant  points  which  may  be  omitted  from  the  training  set  must 
depend  sensitively  on  the  implicit  fitting  procedure  employed  by  the  network.  Moreover, 
the  question  of  which  points  are  redundant  must  also  require  detailed  specific  knowledge  of 
the  network  and  the  relationship  it  is  being  used  to  represent.  As  a  rule,  one  can  expect 
a  network  to  be  capable  of  ‘correctly’  generalising,  only  when  there  is  sufficient  training 
data  appropriately  distributed  to  enable  an  adequate  fit  to  significant  turning  points  of  the 
underlying  graph.  Clearly,  the  more  folded  this  graph  is,  the  more  demanding  will  be  the 
requirment  on  the  data. 


5  An  analytic  solution  to  a  non-exact  problem: 
The  exclusive-OR  problem  with  two  centres. 


The  previous  section,  with  its  related  appendices,  dealt  with  exact  solutions  to  strict  inter¬ 
polation  problems.  For  strict  interpolation  the  interpolating  surface  is  constrained  to  pass 
through  all  the  training  data  points.  This  is  achieved  by  using  the  formalism  described  in 
the  first  part  of  section  2.  which  requires  the  use  of  a  radial  basis  function  (or  hidden  unit) 
for  each  distinguishable  data  pair.  It  was  noted  that  this  rule  may  be  relaxed  in  special 
circumstances  where  the  symmetry  and  other  details  of  the  problem  may  be  employed. 

In  this  section  we  shall  consider  a  specific  example  of  the  more  general  approach  dis¬ 
cussed  at  the  end  of  section  2  which  relaxes  the  strict  interpolation  of  the  data.  In  addition, 
recall  that  in  section  2,  sufficient  scope  was  allowed  in  the  variant  of  radial  basis  function 
techniques  to  accomodate  an  approximate  interpolating  surface  whereby  this  surface  is 
not  directly  constrained  to  go  through  all  (or  any)  of  the  training  set.  This  is  clearly  an 
advantageous  strategy  when  the  input  data  is  corrupted  by  external  sources  and  it  would 
not  be  desirable  to  try  and  fit  the  noise  added  to  the  (presumably)  structured  data.  In 
addition,  where  the  true  data  actually  represents  a  smooth  map,  it  allows  the  use  of  far 
fewer  hidden  units  than  data  points. 

We  repeat  the  analysis  of  the  exclusive-OR  problem  considered  in  the  previous  section, 
but  using  just  two  radial  basis  function  centres.  It  is  clear  that  there  are  two  distinct  choices 
how  the  two  centres  may  be  positioned:  either  on  opposing  vertices,  or  adjacent  vertices 
on  the  ordered  corners  of  the  unit  square.  We  choose  the  locations  of  the  centres  to  be  on 
opposing  vertices  at  (0,0)  and  (1, 1).  This  choice  allows  us  to  exploit  the  symmetry  of  the 
exclusive-OR  function  to  allow  its  solution  with  fewer  ‘training’  points  than  data  points. 

The  total  training  data  is  the  mapping  depicted  in  Table  1.  The  calculations  are  per¬ 
formed  using  the  pseudo-inverse  technique  with,  and  without,  the  use  of  an  adjustable  bias 
on  the  output  unit. 


5.1  The  approximate  exclusive  -OR  without  an  output  bias. 
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Following  section  2,  we  use  the  following  form  of  interpolating  functions: 

«(*)  =  £  xMWz-  Ifyll)  (19) 

>  =  1.3 

where  =  [0, 0]r,  ^  =  [1,  ljr.  The  set  {A,}  is  given  by 

A=  £  ‘  =  1.2, 3, 4  (20) 

>=1.3 

so  that 

A  =  A+/  (21) 

where  /  is  the  same  response  vector  as  in  the  exact  case,  eqn.  (10),  and  A+  is  the  pseudo¬ 

inverse  of  the  (non  square)  transformation  matrix 

(  <t> 0  <t>2  ^ 

A  =  1'  1'  l22) 

<P  2  <P0 

<4>i  4>i  J 

From  Appendix  A,  given  the  singular  value  decomposition  of  A 

A  =  USVT 


the  pseudo-inverse  is  obtained  as 


A+  =  V(S”l)UT 
=  V(S-‘)?VrAT 


The  matrix  V  is  composed  of  the  normalised  eigenvectors  of  the  matrix 


■c :) 


where 


a  -  <Po  +  2<t>i  +  <t> 2 
6  =  2<fr\  +  2<fio4>2 

and  the  diagonal  matrix  (S-1)2  is  made  up  of  the  reciprocal  of  the  eigenvalues  of  the 
corresponding  eigenvectors. 

It  is  straightforward  to  verify  that 


-KM) 

_  (  l/la  +  *1  0 

(  ’  ~  v  o  l./fa  -  6] 


Substituting  these  matrices  into  the  pseudo-inverse  expression,  eqn.  (23)  and  then  into 
eqn.  (21)  gives, 

*=(};)  <2e> 
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where 

Ai  =  2<t>i/[a  +  b\ 

=  2<h  (29) 

4<p\  +  (^>0  +  &!* 

This  is  the  set  of  weights  with  minimum  norm  which  minimises  the  least  mean  squared 
error  to  all  the  training  data.  In  fact  the  error,  t ,  may  be  evaluated  analytically  to  give 

^  _ _ (4>o  +  <M _ 

yjl4>\  +  (&)  +  ^j]S/2 

An  interesting  point  about  this  error,  is  that  it  will  be  zero  if  the  radial  basis  functions 
are  chosen  to  be  such  that  =  ~4> 2-  This  is  precisely  the  condition  mentioned  at  the  end 
of  the  previous  section  in  the  discussion  of  how  the  exact  exclusive-OR  problem  could  be 
solved  with  only  two  radial  basis  function  centres.  In  both  instances  the  error  is  zero  and 
the  interpolating  map  manages  to  perfectly  ‘learn’  the  training  data.  However,  for  general 
choices  of  radial  basis  function  the  solution  as  derived,  does  not  reproduce  the  desired  output 
of  exclusive-OR  satisfactorily.  For  instance,  Figure  6  depicts  this  solution  using  Gaussian 
radial  basis  functions  and  a  Euclidean  norm.  The  figure  plots  contours  of  equal  ‘height’ 
produced  at  the  output  of  the  radial  basis  function  mapping.  The  vertices  of  the  unit 
square  in  the  figure  represent  the  logical  values  00,  01,  11,  10.  As  seen,  although  the  output 
discriminates  succesfully  between  even  and  odd  parity  inputs,  the  relative  magnitudes  have 
not  been  preserved  (the  output  of  00  is  greater  than  the  output  of  01  for  instance).  This 
situation  is  rectified  by  the  inclusion  of  a  ‘bias’  attached  to  the  output  node  as  is  now 
demonstrated. 


(30) 


5.2  The  approximate  exclusive-OR  including  an  output  bias. 

Consider  the  same  situation  as  in  the  previous  subsection,  but  where  now  a  data  independent 
variable  is  allowed,  effectively  a  weighted  bias  at  the  output  node  through  a  connection  to 
a  node  which  gives  a  constant  unit  output.  The  interpolating  function  now  has  the  form 


s(i)  =  Ao+  £  A;>(||i  -  g;.||)  (31) 

>=1,3 

where  { Aj  3}  are  as  previously  assumed.  The  problem  is  to  fit  three  parameters  by  using 
the  same  four  training  points 


The  matrix  of  distances  is  now 


/I 

<00 

4>7  y 

1 

<t>  1 

4>  1 

1 

<t>  2 

^0 

Vi 

<t>  1 

4>i 

(32) 


Consequently,  the  singular  value  decomposition  is  determined 
eigenvalues  of  the  matrix 


aTa=(:  :  i) 


by  the  eigenvectors  and 


(33) 
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Figure  7:  Contours  of  the  approximate  exelusive-OR  solution,  including  an  output  bias. 


where 

Q  =  <f>  o  +  2<t>]  +  4>\ 

b  =  2<f>\  +  2<f>i<f>2  (34) 

c  =  <i>0  +  24>l  +  $7 

Note  that  c  here  has  the  interpretation  of  being  proportional  to  the  mean  value  of  radial 
basis  functions  evaluated  at  any  training  point. 


Consider  the  eigenvalue  equation 


A'A  Uj  =  /i  Uj 


From  the  characteristic  equation  one  finds  that  the  eigenvalue  problem  factorises,  giving 
Po  =  a  -  6 

(a  +  6  +  4)  ±  v''(a  +  b  -  4)J  +  8c2  (^5) 

~  2 


The  normalised  eigenvector  corresponding  to  p  =  a  -  b  is  then 


—  v5U 
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For  the  case  n  =  /i±,  we  have  02  =  <*s,  since  the  resulting  eigenvectors  are  orthogonal 
to  do  Setting  02  =  1  without  loss  of  generality  implies  that  the  (unnormalised)  component 
£*!  i9 


a?  = 


2c 


M±  - 


Thus  the  resulting  orthonormal  matrices  V  and  (S  *)*  take  the  form 


V  = 


(  l/v/2 

V-1/V5 


Qi  /A  + 

1/A+ 

1/A- 

(37) 

1/A+ 

1/A_  } 

Mo  0 

0  ^ 

0  /!  + 

° 

(38) 

0  0 

M-  1 

where  the  normalisation  factors  A±  are  given  explicitly  by 


(A±)’  =  2+(Qf)J 


(39) 


Using  these  matrices  to  construct  the  pseudo-inverse  of  A  results  finally  in  the  set  of 
weights, 

(  Xo\ 


where 


A  =  M. 

VAj 


Ao  —  - 7t(qi^  +  2<Ai)  +  7T(Qi  +  2<^i) 

h+a;  ^i-Ai 


Ai  =  +  20i)  +  ^  ^2  (ai  +  2<M 


(40) 


(41) 


This  result  is  shown  in  Figure  7  and  may  be  compared  directly  with  that  of  the  previous 
subsection  shown  in  Figure  6.  Note  that  the  network  now  succesfully  discriminates  states 
of  opposite  parity  and  moreover  returns  precisely  the  correct  magnitude  for  each  corner  of 
the  unit  square.  However,  the  symmetry  of  placing  the  centres  on  the  diagonal  of  the  unit 
square,  means  that  the  solution  obtained  in  this  case  is  exact.  There  are  only  three  inde¬ 
pendent  equations  we  need  to  solve,  and  three  adjustable  parameters  at  our  disposal.  If  we 
had  chosen  our  centres  to  be  adjacent  vertices  of  the  unit  square,  then  the  symmetry  would 
not  have  existed  to  reduce  the  system  of  four  equations  to  just  three,  and  the  approximate 
analysis  would  not  have  produced  a  zero-error  mapping. 

We  choose  to  interpret  the  action  of  the  output  bias  in  the  following  way.  The  role  of 
the  output  bias  rests  on  the  fact  that  the  desired  output  states  of  the  exclusive-OR  function 
have  non-zero  mean.  The  analysis  without  the  inclusion  of  bias  achieves  a  minimum  error 
solution  which  matches  the  output  tn  the  mean.  However,  since  positive  weights  are  needed 
to  achieve  this,  the  resulting  s(i)  naturally  has  maxima  near  to  the  centres  (0,0),  (0, 1). 
Therefore,  s(i)  does  not  reproduce  the  required  qualitative  details  of  the  exclusive-OR 
function.  In  contrast,  the  inclusion  of  the  bias  allows  the  whole  surface  s(i)  to  be  ‘floated’ 
to  the  correct  mean  level  while  the  remaining  parameters  adjust  its  qualitative  form. 
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It  is  interesting  to  repeat  the  calculation  of  section  5.1  using  a  response  vector  /  = 
(—  j,  - 1, 1)T.  This  allows  one  to  study  an  equivalent  problem  with  the  mean  artificially 
removed.  This  produces,  again,  two  equal  weight  values  (compare  with  (29)) 

_  <Ai  -  [to  -1-  to]/2 
1  8^  +  2[&)  +  to]7 

The  resultant  fit  does  not  reproduce  the  training  values  as  well  as  the  three  parameter 
model  given  above.  It  does  however  have  the  correct  qualitative  form.  The  extra  bias 
parameter  provides  a  compensation  for  a  global  shift  which  is  hard  to  achieve  through 
weighting  the  individual  basis  functions.  The  consequences  of  this  observation  may  also  be 
noted  in  conventional  multi-layer  perceptron  studies  where  the  performance  of  the  network 
is  enhanced  if  the  input  data  is  previously  scaled  to  have  zero  mean. 


6  Numerical  examples: 

the  prediction  of  chaotic  time  series. 


Lapedes  and  Farber  [13]  have  recently  used  multi-layer  perceptrons  for  the  prediction  of  time 
series  data  generated  by  nonlinear  dynamical  systems.  Extensive  comparisons  with  other 
prediction  techniques  showed  that  multi-layer  perceptrons  were  more  accurate  than  the 
classical  (linear)  methods  and  were  comparable  with  the  locally  linear  approach  of  Farmer 
and  Sidorowich  [16],  In  this  section,  we  shall,  following  Lapedes  and  Farber,  use  nonlinear 
prediction  as  a  non-trivial  example  of  the  application  of  adaptive  networks.  We  note  that 
in  this  application  our  approach  is  very  close  to  that  of  Casdagli  [17]  who  has  applied 
radial  basis  functions  to  the  construction  of  nonlinear  maps  from  time  series  data.  Unlike 
Casdagli  who  used  strict  interpolation,  we  shall  employ  the  least  squares  generalisation 
given  in  section  2. 

Specifically,  consider  Tj,  an  ordered  sequence  of  iterates  of  the  doubling  map: 

xn+ 1  =  2r„  (modulo  1)  (42) 

and  Tq,  a  sequence  of  iterates  of  the  quadratic  map 

xn+i  =  4xn(l  -  x„)  (43) 

These  maps  are  known  to  be  chaotic  on  the  interval  [0, 1]:  in  both  cases  the  iterates  of 
generic  initial  conditions  are  distributed  according  to  continuous  invariant  measures.  For 
Ti  the  autocorrelation  (x0xn)  decays  as  2~n  while  for  T~,  ( xox„ )  as  So.n  where  StJ  is  the 
Kroenecker  delta.  Therefore,  given  only  the  data  Tq,  second  order  statistics  would  convey 
the  impression  that  the  sequence  is  random  broadband  noise  (see  Appendix  C  for  further 
details).  Naively  (and  in  fact,  erroneously)  one  might  expect  from  this  that  the  prediction 
of  Tq  is  harder  than  the  prediction  of  T(<. 

A  radial  basis  function  network  for  predicting  one  time  step  into  the  future  was  con¬ 
structed  as  follows.  The  basis  function  centres  {y;}  were  chosen  to  be  uniformly  spaced 
on  (0,1)  -  the  number  of  centres  was  an  adjustable  parameter.  A  set  of  input  values 
(x,  €  [0,  l]|i  =  1,2,..., 250}  was  used  to  calculate  the  matrix  A  using  equation  (4).  The 
singular  value  decomposition  of  A,  calculated  numerically  by  a  Golub-Reinsch  algorithm, 
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Figure  8:  The  Quadratic  Map:  A  figure  showing  the  actual  (solid  line),  and  predicted  (filled 
squares)  outputs  of  the  network  over  the  interval  [0,  lj  for  one  iterate. 


was  used  to  form  the  pseudo  inverse  A+  using  equation  (46).  This  was  then  applied  to  the 
vector  of  outputs: 

Z=(/(x.),...,/(x,),...,/(x2S0))T 

(where  f(x)  is  the  map  given  by  either  equation  (42),  or  equation  (43))  to  obtain  the  weights 
{Aj}  according  to  A  =  A +/  The  accuracy  of  this  mapping  were  then  analysed  for  an  extra 
250  different  ‘test’  points. 

Figures  8,  and  9  which  show  the  output  of  the  networks  as  a  function  of  inputs,  illustrate 
the  relationship  with  curve  fitting  for  these  simple  one-dimensional  problems.  It  is  clear 
that  the  basis  of  the  difference  between  predicting  T4  and  predicting  Tq  is  that  the  doubling 
map  is  discontinuous  and  therefore  hard  to  fit.  Multi-layer  perceptrons  also  have  difficulty 
with  trying  to  find  an  appropriate  set  of  weight  values  which  allows  a  good  fit  to  T<j  (in  fact 
the  overwhelming  tendency  is  for  the  multi-layer  perceptron  to  get  stck  in  an  unsuitable 
local  minimum,  M.D.  Bedworth,  private  communication). 


For  prediction  further  into  the  future,  the  situation  is  further  exacerbated  and  rapidly 
becomes  hard  even  in  the  case  of  Tq.  The  problem  is  now  one  of  fitting  a  graph  of  the 
n-th  order  iterate  of  equation  (42)  or  (43).  In  either  case  the  graph  has  2n_1  oscillations 
of  unit  amplitude.  In  terms  of  the  radial  basis  function  network,  this  would  require  at  least 
2”  hidden  units  with  centres  appropriately  positioned.  An  alternative  to  this  strategy  is  to 
iterate  the  one-step  network.  This  however,  is  inaccurate  since  errors  in  chaotic  systems 
grow  exponentially  because  of  the  local  instability  of  the  evolution. 

The  accuracy  of  the  network  can  be  quantified  by  the  following  index,  J: 


I  = 


(\xpredicttd(t)  Xttpte(trf(t)P) 

<(*-<*))*> 


(44) 
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Figure  9:  The  Doubling  Map:  A  figure  showing  the  actual  (solid  line),  and  predicted  (filled 
squares)  outputs  of  the  network  over  the  interval  [0, 1]  for  one  time  step  into  the  future  for 
the  doubling  map. 


This  quantity  is  a  measure  of  how  well  the  network  generalises  beyond  the  training 
data.  The  error  expression  given  in  section  2  has  the  same  form,  but  since  it  is  based  on 
the  training  data,  shows  how  well  the  network  reproduces  the  training  data  It  is  of  interest 
to  compare  the  two  since  the  difference  quantifies  the  degredation  of  the  predictive  power 
of  the  network  when  it  is  required  to  generalise.  The  graphs  shown  in  Figures  10  and  11 
summarise  both  kinds  of  error  analysis  for  networks  trained  on  Tj,  and  Tq. 

The  most  obvious  difference  between  these  figures  is  the  scale.  It  is  clear  that  prediction 
of  Tq,  whichever  error  criterion  is  used,  is  much  easier  than  the  prediction  of  Tj  by  several 
orders  of  magnitude.  Beyond  this,  we  see  in  both  cases  that  the  training  error  of  section  2  has 
the  same  basic  dependence  on  the  number  of  hidden  units;  that  is,  a  fast  improvement  as  no 
increases  to  about  30  followed  by  a  plateau  region  where  the  relative  improvement  is  small. 
As  no  approaches  the  number  of  data  points  used  in  the  training  (250  in  this  example), 
the  training  error  again  drops  rapidly  as  the  problem  approaches  the  strict  interpolaton 
limit.  This  drop  is  not,  however,  mirrored  by  a  drop  in  the  recognition  error.  Although 
initially,  the  recognition  error  follows  the  training  error  very  closely,  a  saturation  plateau  is 
reached  and  approximately  maintained  irrespective  of  how  many  hidden  units  are  employed. 
This  can  be  understood  since  the  capability  of  the  model  to  generalise,  is  connected  with 
the  underlying  ‘smoothness’  of  the  true  map  and  the  level  of  ‘smoothness’  built  into  the 
model  through  the  choice  of  metric  and  radial  basis  function  (and  indeed  the  assumption 
that  an  arbitrary  function  may  be  approximately  represented  by  the  radial  basis  function 
expansion).  Therefore  one  can  surmise  that  in  most  instances,  there  will  be  a  limiting 
accuracy  to  which  it  is  possible  to  model  unseen  data  generated  by  a  mapping.  This  is 
not  true  for  the  training  points  themselves,  since  it  is  possible  by  strict  interpolation  to 
produce  a  mapping  surface  which  exactly  passes  through  all  the  points.  However,  all  that 
this  accomplishes  is  a  fit  to  the  noise  on  the  training  points  which  may  oscillate  wildly 
between  the  constraining  ‘knots’.  It  was  for  this  very  reason  that  we  introduced  the  least 
squares  solution  of  the  radial  basis  function  construction  in  section  2. 
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Figure  10:  The  Quadratic  Map:  The  log  normalised  error  showing  the  training  (solid  circles) 
and  recognition  data  (open  circles)  as  a  function  of  the  number  of  radial  basis  function 
centres.  Euclidean  norm  and  a  Gaussian  radial  basis  function  {4>  —  exp[- rzrio/16])  were 
used. 
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Figure  11:  The  Doubling  Map:  The  log  normalised  error  showing  the  training  (solid  circles) 
and  recognition  data  (open  circles)  as  a  function  of  the  number  of  radial  basis  function 
centres.  Euclidean  norm  and  a  Gaussian  radial  basis  function  ( <f>  =  exp(-z2rio/ 16])  were 
used. 
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7  Conclusion 


The  object  of  this  paper  has  been  to  introduce  a  simple  view  of  network  models  as  devices 
for  the  interpolation  of  data  in  multidimensional  spaces.  The  purpose  of  this  is  to  allow  the 
application  of  a  large  body  of  intuition  and  knowledge  from  the  theory  of  fitting  and  interpo¬ 
lation  to  the  understanding  of  the  properties  of  nonlinear  networks.  Thus  we  associate  the 
concept  of  generalisation  in  networks  with  the  simple  idea  of  interpolation  (extrapolation) 
between  known  data  points.  The  details  of  the  generalisation  are  then  dependent  upon  the 
implicit  interpolation  scheme  employed  by  a  given  network.  Generalisation  is  hard  where 
the  relationship  has  strong  oscillations  or  discontinuities.  This  suggests  that,  particularly 
in  the  case  of  abstract  problems  for  which  the  topology  of  the  input  and  output  spaces  may 
not  be  clear  a  priori,  it  may  be  advantageous  to  attempt  to  code  the  data  so  as  to  produce 
a  relationship  which  is  as  smooth  as  possible.  Further  we  expect  the  training  data,  where 
possible,  would  best  be  distributed  to  give  information  about  all  the  turning  points  of  the 
graph  and  need  not  be  tightly  clustered  where,  for  example,  the  relationship  is  smooth  or 
monotone. 


Motivated  by  this  philosophy,  we  introduce  a  network  model  based  on  the  radial  basis 
function  approach  to  curve  fitting.  This  model  has  two  main  advantages.  First,  it  is 
firmly  attached  to  a  well  established  technique  for  fitting,  but,  since  it  is  contained  within 
a  general  class  of  nonlinear  networks,  it  may  be  used  as  a  source  of  ‘existence  proofs’  for 
such  networks.  For  instance,  we  know  that  networks  of  this  form  can  be  used  to  model 
relationships  which  lie  in  the  function  space  spanned  by  the  chosen  set  of  radial  basis 
functions.  The  characterisation  of  this  space  and  quantification  of  such  things  as  the  rate 
of  convergence  of  radial  basis  function  expansions  is  currently  receiving  much  attention  and 
is  seen  to  be  of  direct  relevance  to  the  theory  of  networks. 


The  second  advantage  of  this  network  is  in  practical  application.  The  basis  of  its  sim¬ 
plicity  is  that  it  combines  a  linear  dependence  on  the  variable  weights  with  an  ability  to 
model  explicitly  nonlinear  relationships  such  as  for  example,  the  exclusive-OR  function 
Thus,  in  the  least  squares  context,  training  the  network  is  equivalent  to  solving  a  linear 
matrix  equation.  If  we  specialise  to  a  minimum  norm  solution,  the  solution  is  unique  and 
in  this  sense  the  network  may  be  said  to  have  a  guaranteed  learning  algorithm. 

This  general  approach,  whereby  optimisation  is  carried  out  on  the  subset  of  the  weights 
for  which  the  problem  is  linear,  may  be  taken  with  other  network  models.  It  would  be 
interesting  to  study  how  much  this  restricts  their  generality.  Work  along  these  lines  is 
currently  in  progress.  In  the  present  case,  on  the  other  hand,  the  inclusion  of  the  basis 
function  centres  into  the  optimisation  calculation  may  be  carried  out  using  a  nonlinear 
optimisation  technique  in  conjunction  with  the  linear  analysis  described  above.  By  analogy 
with  spline  fitting  of  curves,  this  may  produce  some  advantage,  perhaps  in  the  form  of 
needing  fewer  hidden  units,  but  it  is  questionable  whether  this  would  compensate  for  the 
added  complexity  of  performing  the  nonlinear  optimisation.  We  have  not  approached  here 
the  general  question  of  what  form  of  <f>  is  best,  or  where  and  how  many  centres  should  be 
used  in  the  expansion.  Work  is  currently  in  progress  to  assess  the  sensitivity  of  convergence 
of  these  factors  and  the  use  of  the  error  function  given  in  equation  (7)  as  a  cost  function 
for  nonlinear  optimisation  using  the  basis  function  centres. 
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A  Solving  linear  inverse  problems. 


The  appendix  looks  at  linear  inverse  problems  as  they  arise  in  the  radial  basis  function 
approach  to  nonlinear  networks. 

In  applying  the  radial  basis  function  method,  we  need  to  invert  an  m  x  n  (m  >  n) 
matrix  with  elements  AtJ  =  4>( \\x,  -  Since  A  may  be  rank  deficient,  it  is  necessary  to 
account  for  the  possibility  of  ill-conditioning.  Consider  the  singular  value  decomposition  of 

A 

A  =  USVT  (45) 

where  V  is  an  n  x  n  orthogonal  matrix,  S  is  an  n  x  n  diagonal  matrix  of  singular  values 
and  U  is  an  m  x  n  marix  with  orthonoTnal  columns.  The  pseudo  inverse  of  A,  A4,  may 
be  constructed  as 

A4=VS+Ut  (46) 

where  S+  is  obtained  from  S  by  reciprocating  the  non-zero  diagonal  elements.  Clearly 
if  rank  A-  m  then  A4  A  =  1  where  1  is  the  m  x  m  unit  matrix.  On  the  other  hand, 

AAJ  =  UUr  a  superposition  of  projections  onto  a  subspace  of  Rn  spanned  by  the  columns 

of  U.  If  rankA<  m  then  A4A  and  AA4  give  projections  onto  subspaces  of  F.m  and  Rn 
respectively 

In  the  case  of  the  exact  exclusive-OR  function  the  question  of  ill-conditioning  does 
not  arise  In  this  case  it  is  convenient  to  calculate  the  inverse  of  A  through  its  eigenvalue 
decomposition  since  the  symmetry  of  the  problem  may  be  exploited  to  obtain  the  solution 
Here 

A  =  V^Vr  (47) 

where,  since  in  this  case  A  is  a  real  symmetric  matrix,  the  matrix  of  eigenvectors  V  is 
orthogonal  It  follows  that 

A-1  =  V^-1VT  (48) 

assuming  that  A  is  full  rank.  The  rest  of  the  appendix  deals  with  the  calculation  of  the 
eigenvectors  and  eigenvalues  of  A  using  the  symmetry  of  the  exclusive-OR  function. 

Our  choice  of  ordering  of  the  input  points  in  section  4  is  somewhat  arbitrary.  It  should 
be  clear  that  we  can  perform  a  sequence  of  rotations  on  the  original  orderings  while  retaining 
the  same  matrix  A.  In  other  words,  A  is  invariant  to  a  certain  class  of  transformations,  in 
particular,  it  is  invariant  to  operations  in  the  group  C«  =  {E,  C4,  Cj,  C®},  where  E  is  the 
identity  transformation,  C<  denotes  rotations  by  n/2,  Cj  by  x  and  C4  rotations  by  3n/2. 
The  character  table  for  the  group  C«  is  shown  in  Table  2  (for  an  introduction  to  the  theory 
and  application  of  groups  see  [19]). 

The  character  table  may  be  exploited  to  solve  the  eigenvalue  problem  for  A  by  obtaining 
a  symmetry  adapted  basis.  We  can  see  this  as  follows.  The  representation  of  the  group 
operations  using  the  standard  basis. 
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C4 

E 

CA 

c2 

A 

1 

1 

1 

1 

B 

E 

1 

{: 

-1 

I 

—I 

I 

-1 

-1 

-I 

7} 

Table  2.  The  character  table  for  C4 

is  not  irreducible  since  it’s  dimension  is  four  and  the  maximum  dimension  of  an  irreducible 
representation  of  C4  is  two.  In  fact  this  representation  of  the  group  has  the  form  T  = 
A  +  B  +  E  and  so  the  basis  has  irreducible  components  A,  B,  E.  From  the  character  table 
,  one  can  ascertain  that  the  appropriate  symmetry  adapted  basis  is  just 


(!] 

(  M 

-1 

(  M 

i 

(  l\ 
-i 

Ha  = 

oJ 

UB  = 

t-jj 

V£  = 

-1 

l  -«  J 

v'e  - 

-1 

{  i) 

or,  by  normalising  and  replacing  the  degenerate  vectors  vE  and  v"E  by  simple  linear  combi¬ 
nations,  we  arrive  at  a  symmetry  adapted  set  of  basis  vectors, 


(l) 

(  M 
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(  °\ 

1 

1 
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-i 
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2  ^ 
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~A  ~  2 
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-v/2 

Vp  =  - 
-E  2 
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v-iy 

l  o) 

[-V2J 

It  is  clear  that  these  basis  vectors  are  orthogonal  and  they  are  eigenvectors  of  the  matrix 


A  since,  explicitly, 

A  vA 

=  (0o  +  20i  +  0^/5)^  => 

Ha  =  (0o  +  20i  +  0^) 

Avb 

-  (0o  -  201  +  0,^2 )vB  => 

HB  =  (00  -  20i  4-  0^) 

(50) 

Avg 

=  (00  -  => 

HE  =  (00  ~  0^) 

These  basis  vectors  and  eigenvalues  are  employed  in  section  4  to  obtain  a  set  of  weights 
analytically,  which  exactly  solves  the  exclusive-OR  function. 
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B  An  analytic  solution  of  the  exact  n-bit  parity  problem 


The  n-bit  parity  problem  is  a  generalisation  of  the  exclusive-OR  problem  discussed  in 
section  4.  It  may  be  defined  by  the  mapping  depicted  in  Table  3,  that  is,  the  output  is 
unity  if  the  total  number  of  input  bits  having  value  one  is  odd,  otherwise  the  output  is  zero. 
Thus  changing  one  bit  in  the  input  pattern  produces  a  maximal  change  in  the  output 


Input 

Pattern 

Number  of  ‘ON’ 
bits  in  input 

Output 

Pattern 

000 

-> 

0 

— 

0 

001 

— 

1 

— 

1  1 

010 

1 

— 

1 

100 

1 

— 

1 

011 

— 

2 

0 

101 

— 

2 

— 

0 

110 

— 

2 

0 

111 

-* 

3 

i  ! 

Table  3:  Symbolic  mapping  of  the  n-bit  parity  problem  for  the  case  n  =  3. 


With  the  benefit  of  insight  developed  from  the  exclusive-OR  problem,  this  section  ob¬ 
tains  an  exact  representation  of  the  n-bit  parity  problem  as  a  network  based  on  radial 
basis  functions.  The  network  will  have  the  general  form  of  n-inputs  for  an  n-bit  word, 
and  2n  hidden  units  all  connected  to  one  output.  The  centres  of  the  2"  hidden  units  cor¬ 
respond  to  the  possible  input  patterns.  Thus,  an  exact  solution  may  be  obtained  once  the 
2"-dimensional  vector  A  of  weights  has  been  determined.  All  possible  input  states  may  be 
put  in  a  1  :  1  correspondence  with  the  vertices  of  a  unit  n-dimensional  hypercube.  This 
is  conveniently  achieved  by  aligning  the  hypercube  with  the  Cartesian  basis  so  that  one 
vertex  resides  at  the  origin.  The  Cartesian  co-ordinates  of  each  vertex  then  directly  maps 
to  a  unique  binary  sequence,  for  instance  (0,0,0)  — *  000,  (0,1,0)  — *  010,  (1,1,0)  — *  110, 
etc.  The  vertices  may  be  ordered  by  treating  the  set  of  sequences  as  a  cyclic  Grey  code 
of  the  first  2"  integers.  Thus  all  nearest  neighbours  in  this  scheme  correspond  to  points  of 
opposite  parity  and  the  use  of  the  cyclic  Grey  code  ensures  that  entries  across  the  rows  of 
A  represent  points  of  alternating  parity. 

The  rows  of  A  are  permutation®  of  each  other  because  of  the  symmetry  of  the  hypercube. 
It  follows  that  there  is  a  totally  symmetric  eigenvector,  v+  =  2-n/2[l,  1, . .  ,]T  for  which  the 
corresponding  eigenvalue  is  the  sum  of  the  row  elements  of  A 

n 

(51) 

where  p"  is  the  number  of  j-th  nearest  neighbours  to  an  arbitrary  vertex. 

A  second  eigenvector  may  be  found  using  the  division  of  the  vertices  into  two  groups, 
differentiated  by  their  parity.  The  form  of  this  eigenvector  follows  from  the  use  of  the  cyclic 
Grey  code  in  ordering  the  vertices:  v_  =  2“"/2|l, -1, 1, . .  .j'T  This  antisymmetric  form 
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distinguishes  sites  of  opposite  parity  and  thus  has  a  corresponding  eigenvalue 

=  £  py,  -  £  py,  (52) 

j  even  j  odd 

since  there  are  tvtnP™  sites  the  same,  and  H.,  oddP "  sites  w>th  opposite  parity 

The  coefficients  p"  may  be  obtained  by  comparing  two  arbitrary  n-bit  sequences  which 
differ  in  j-locations.  The  number  of  ways  of  permuting  j-locations  within  n-bits  is  just 

p"  =  (j'j'  Note  that  Ho  P;n  =  2",  the  total  number  of  vertices  of  the  hypercube 


Figure  12:  The  hypercube  of  the  3-bit  parity  problem 

The  response  vector,  s  =  [0, 1,0, 1,0, . . .]  may  be  decomposed  as  the  difference  of  the 
symmetric  and  antisymmetric  eigenvectors.  Thus,  v+,v_  are  the  only  eigenvectors  which 
are  relevant  in  evaluating  the  weights.  Consequently,  as  in  the  exclusive-OR  example,  the 
vector  of  weights  of  the  n-bit  parity  problem  may  be  evaluated  as 
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where, 


Ai  — 


~  E;  odd  ^  y  )  $3 


— 


E>  ««n  (j)*} 


(54) 


Equation  (54)  accomplishes  what  the  section  set  out  to  obtain:  an  exact  solution  to  the 
n-bit  parity  problem  in  the  sense  that  there  is  a  guaranteed  set  of  values  for  the  weights  of 
the  matrix  which  ensures  that  the  result  of  the  network  is  to  reproduce,  at  least,  the  values 
exhibited  in  Table  3.  It  should  be  noted  that  although  this  task  has  been  achieved  with 
a  network  involving  2"  hidden  units,  it  is  still  possible  to  solve  the  problem  exactly  with 
fewer  than  this  number  of  hidden  units.  For  instance,  and  by  analogy  with  the  exclusive- 
OR  problem,  if  the  radial  basis  function  and  metric  were  chosen  in  such  a  way  that  either 

(55) 

3  odd  w  7 

are  satisfied,  then  an  exact  solution  may  be  obtained  with  only  2n_1  hidden  units. 
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C  The  quadratic  and  doubling  maps 


This  appendix  lists  a  few  details  relevant  to  the  chaotic  maps  discussed  in  section  6  for  time 
series  prediction.  The  quadratic  map,  as  discussed,  determines  the  signal  at  time  t  4-  1  in 
terms  of  the  signal  at  time  t  by  the  explicit  iteration  scheme 


xt+ 1  =  4x((l  ~  xt)  (56) 

By  a  simple  linear  transformation,  x  — *  y,  this  map  is  transformed  into  the  form 

yn+1  =  2y*  -  1  (57) 

Since  these  two  mappings  are  related  by  a  linear  transformation  (x  =  0.5 [  1  —  y ) ) ,  the 
behaviour  of  equation  (57)  determines  the  behaviour  of  the  quadratic  map.  Equation  (57) 
is  interesting  because  it  illustrates  that  the  mapping  is  explicitly  given  in  terms  of  Chebyshev 
polynomials  [20],  specifically 

yn+i  =  T2(yn)  (58) 


From  this  relationship  and  exploiting  the  property  of  Chebyshev  polynomials  that 


2  TmT„  —  Tn+m  +  Tn_m 


(59) 


or,  specifically 

2  Ti  =  TJn  +  To  (60) 

one  finds  that  the  n-th  iterate,  y„,  is  connected  with  the  starting  value  y0  through  the  2n 
Chebyshev  polynomial, 

yn  =  T2n(y0)  (61) 


This  just  makes  explicit  the  fact  that  the  value  of  the  map  at  any  time  in  the  future 
is  uniquely  determined  by  the  starting  value.  However,  the  map  is  known  to  be  ergodic, 
and  thus  time  averages  are  equivalent  to  phase  averages.  To  obtain  the  phase  average,  one 
needs  the  fact  that  the  invariant  measure  of  the  map,  eqn.  (57),  is 

m (y)  =  '  7==f 

*Vl - y 2 

Therefore,  the  correlation  function  (y*y t+;)  may  be  determined  by  the  expectation  value 


(62) 


(y*y  k+j> 


T2t(yo)T2t+j(yo) 


n\/ 1 


yo 


rfyo 


(63) 


However,  this  is  precisely  the  orthogonality  relationship  between  Chebyshev  polynomials 
of  the  first  kind,  and  hence  the  integral  yields  a  Kroeneker  delta. 


(y*yt+2)  =  -s,.o 


(64) 


Consequently,  as  far  as  second  order  statistics  are  concerned,  the  time  series  generated 
by  the  quadratic  map  totally  loses  its  memory  between  each  time  step,  and  hence  would 


-31- 


D.S.  Broomhead  and  David  Lowe 


appear  to  be  an  infinite  bandwidth,  noise  signal  (this  is  of  course  not  the  case  when  higher 
order  correlations  are  taken  into  account). 

In  the  case  of  the  doubling  map,  the  value  of  the  time  series  at  time  t  +  1  is  determined 
by  the  time  series  at  t  via 

zt+i  =  2xt  mod  1  (65) 

The  correlation  function  may  be  expressed  as 


(xoxt)  =  [  *o|2‘xol  dx 
Jo 


2'-1  . 

=  £/ 

,=0 


(>+»)/« 
/a  i 


x[2‘ij  dx 


(66) 


where  [z]  denotes  the  fractional  part  of  x  (note  that  the  invariant  measure  is  uniform  in 
this  case  and  the  map  is  known  to  be  chaotic  so  that  time  averages  and  ensemble  averages 
are  equivalent).  By  a  change  of  variables,  y  =  2 (x-j  the  above  integral  is  readily  performed 
to  give:- 

<x0*,)  =  2-,,g  {§  +  ^} 

(67) 


Thus,  in  contrast  to  the  quadratic  map,  the  correlation  function  for  the  doubling  map 
decays  exponentially  in  time. 
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